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Abstract 

We extend the theory of boosting for regression problems to the online learning setting. Gen¬ 
eralizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled 
as an online learning algorithm with linear loss functions that competes with a base class of re¬ 
gression functions, while a strong learning algorithm is an online learning algorithm with smooth 
convex loss functions that competes with a larger class of regression functions. Our main result 
is an online gradient boosting algorithm that converts a weak online learning algorithm into a 
strong one where the larger class of functions is the linear span of the base class. We also give 
a simpler boosting algorithm that converts a weak online learning algorithm into a strong one 
where the larger class of functions is the convex hull of the base class, and prove its optimality. 


1 Introduction 


Boosting algorithms [2l| are ensemble methods that convert a learning algorithm for a base class of 
models with weak predictive power, such as decision trees, into a learning algorithm for a class of 
models with stronger predictive power, such as a weighted majority vote over base models in the case 
of classification, or a linear combination of base models in the case of regression. 

Boosting methods such as AdaBoost [qJ and Gradient Boosting [lOj have found tremendous prac¬ 
tical application, especially using decision trees as the base class of models. These algorithms were 
developed in the batch setting, where training is done over a fixed batch of sample data. However, 
with the recent explosion of huge data sets which do not fit in main memory, training in the batch 
setting is infeasible, and online learning techniques which train a model in one pass over the data have 
proven extremely useful. 

A natural goal therefore is to extend boosting algorithms to the online learning setting. Indeed, 
there has already been some work on online boosting for classification problems (300E1SS11. 
Of these, the work by Chen et al. J4j provided the first theoretical study of online boosting for 
classification, which was later generalized by Beygelzimer et al. Q to obtain optimal and adaptive 
online boosting algorithms. 

However, extending boosting algorithms for regression to the online setting has been elusive and 
escaped theoretical guarantees thus far. In this paper, we rigorously formalize the setting of online 
boosting for regression and then extend the very commonly used gradient boosting methods [30 
to the online setting, providing theoretical guarantees on their performance. 
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The main result of this paper is an online boosting algorithm that competes with any linear 
combination the base functions, given an online linear learning algorithm over the base class. This 
algorithm is the online analogue of the batch boosting algorithm of Zhang and Yu 24], and in fact our 
algorithmic technique, when specialized to the batch boosting setting, provides exponentially better 
convergence guarantees. 

We also give an online boosting algorithm that competes with the best convex combination of 
base functions. This is a simpler algorithm which is analyzed along the lines of the Frank-Wolfe 
algorithm [sj. While the algorithm has weaker theoretical guarantees, it can still be useful in practice. 
We also prove that this algorithm obtains the optimal regret bound (up to constant factors) for this 
setting. 

Finally, we conduct some proof-of-concept experiments which show that our online boosting algo¬ 
rithms do obtain performance improvements over different classes of base learners. 


1.1 Related Work 

While the theory of boosting for classification in the batch setting is well-developed (see ill), the 
theory of boosting for regression is comparatively_sparse.The foundational theory of boosting for 
regression can be found in the statistics literature 1141 [l3j, where boosting is understood as a greedy 
stagewise algorithm for fitting of additive models. The goal is to achieve the performance of linear 
combinations of base models, and to prove convergence to the performance of the best such linear 
combination. 

While the earliest works on boosting for regression such as 0 do not have such convergence 
proofs, later works such as 00 do have convergence proofs but without a bound on the speed 
of convergence. Bounds on the speed of convergence have been obtained by Duffy and Helmbold 
[7] relying on a somewhat strong assumption on the performance of the base learning algorithm. 
A different approach to boosting for regression was taken by Freund and Schapire [9(, who give 
an algorithm that reduces the regression problem to classification and then applies AdaBoost; the 
corresponding proof of convergence relies on an assumption on the induced classification problem 
which may be hard to satisfy in practice. The strongest result is that of Zhang and Yu [24|, who 
prove convergence to the performance of the best linear combination of base functions, along with a 
bound on the rate of convergence, making essentially no assumptions on the performance of the base 
learning algorithm. Telgarsky [221 ] proves similar results for logistic (or similar) loss using a slightly 
simpler boosting algorithm. 


The results in this paper are a generalization of the results of Zhang and Yu 24| to the online 
setting. However, we emphasize that this generalization is nontrivial and requires different algorithmic 
ideas and proof techniques. Indeed, we were not able to directly generalize the analysis in 24] by 


simply adapting the techniques used in recent online boosting work Hi , but we made use of the 
classical Frank-Wolfe algorithm 0. On the other hand, while an important part of the convergence 
analysis for the batch setting is to show statistical consistency of the algorithms 0S0, in the 
online setting we only need to study the empirical convergence (that is, the regret), which makes our 
analysis much more concise. 


2 Setup 

Examples are chosen from a feature space X, and the prediction space is R d . Let || • || denote some 
norm in R d . In the setting for online regression, in each round t for t = 1,2, ...,T, an adversary 
selects an example x t £ X and a loss function £ t ■ R d —> R, and presents x t to the online learner. The 
online learner outputs a prediction y t S R d , obtains the loss function £ t , and incurs loss £t{yt)- 

Let T denote a reference class of regression functions / : X —R d , and let C denote a class of loss 
functions £ : —> R. Also, let R : N —> R+ be a non-decreasing function. We say that the function 

class J- is online learnable for losses in C with regret R if there is an online learning algorithm A, 
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that for every T £ N and every sequence (x t , £ t ) £ X x C for t = 1,2,... ,T chosen by the adversary, 
generates prediction^] _A(x t ) £ R d such that 


£W( X *)) < inf X)«t(/(x t )) + i2(T). 

t= 1 ^ ^ t= 1 


( 1 ) 


If the online learning algorithm is randomized, we require the above bound to hold with high proba¬ 
bility. 

The above definition is simply the online generalization of standard empirical risk minimization 
(ERM) in the batch setting. A concrete example is 1-dimensional regression, i.e. the prediction space 
is R. For a labeled data point (x, y*) £ X x R, the loss for the prediction y £ R is given by £(y*,y) 
where £(•,•) is a fixed loss function that is convex in the second argument (such as squared loss, 
logistic loss, etc). Given a batch of T labeled data points {(x t , y^) \ t = 1,2,..., T} and a base class 
of regression functions J- (say, the set of bounded norm linear regressors), an ERM algorithm finds 
the function / £ T that minimizes Y^t=i ^{Vti ffat))- 

In the online setting, the adversary reveals the data (x. t ,y^) in an online fashion, only presenting 
the true label y% after the online learner A has chosen a prediction y t . Thus, setting £t(yt ) = /’(yj), Vt), 
we observe that if A satisfies the regret bound (JT]) , then it makes predictions with total loss almost as 
small as that of the empirical risk minimizer, up to the regret term. If T is the set of all bounded-norm 
linear regressors, for example, the algorithm A could be online gradient descent [25] or online Newton 
Step 0- 

At a high level, in the batch setting, “boosting” is understood as a procedure that, given a batch of 
data and access to an ERM algorithm for a function class J- (this is called a “weak” learner), obtains 
an approximate ERM algorithm for a richer function class T' (this is called a “strong” learner). 
Generally, T' is the set of finite linear combinations of functions in J-. The efficiency of boosting is 
measured by how many times, N, the base ERM algorithm needs to be called (i.e., the number of 
boosting steps) to obtain an ERM algorithm for the richer function within the desired approximation 
tolerance. Convergence rates IT! give bounds on how quickly the approximation error goes to 0 and 
N —> oo. 

We now extend this notion of boosting to the online setting in the natural manner. To capture 
the full generality of the techniques, we also specify a class of loss functions that the online learning 
algorithm can work with. Informally, an online boosting algorithm is a reduction that, given access 
to an online learning algorithm A for a function class J- and loss function class C with regret R, and a 
bound N on the total number of calls made in each iteration to copies of A, obtains an online learning 
algorithm A' for a richer function class T', a richer loss function class C, and (possibly larger) regret 
R'. The bound N on the total number of calls made to all the copies of A corresponds to the number 
of boosting stages in the batch setting, and in the online setting it may be viewed as a resource 
constraint on the algorithm. The efficacy of the reduction is measured by R' which is a function of 
R, N, and certain parameters of the comparator class T' and loss function class C. We desire online 
boosting algorithms such that -f i?'(T) —> 0 quickly as N —> oo and T — > oo. We make the notions of 
richness in the above informal description more precise now. 


Comparator function classes. A given function class T is said to be /Abounded if for all x £ X 
and all / £ T, we have ||/(x)|| < D. Throughout this paper, we assume that T is symmetric!! i.e. if 

1 There is a slight abuse of notation here. *4(-) is not a function but rather the output of the online learning algorithm 
A computed on the given example using its internal state. 

2 This is without loss of generality; as will be seen momentarily, our base assumption only requires an online learning 
algorithm A for T for linear losses it . By running the Hedge algorithm on two copies of A , one of which receives the 
actual loss functions it and the other recieves — it, we get an algorithm which competes with negations of functions 
in T and the constant zero function as well. Furthermore, since the loss functions are convex (indeed, linear) this can 
be made into a deterministic reduction by choosing the convex combination of the outputs of the two copies of A with 
mixing weights given by the Hedge algorithm. 
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/ £ J, then —/ £ J -, and it contains the constant zero function, which we denote, with some abuse 
of notation, by 0. 

Given J 7 , we define two richer function classes T': the convex hull of J 7 , denoted CH(J 7 ), is 
the set of convex combinations of a finite number of functions in J 7 , and the span of J 7 , denoted 
span(J 7 ), is the set of linear combinations of finitely many functions in J 7 . For any f £ span(J 7 ), 

define ||/||i := inf |max{l, |w g |} : f = Y^g£S w 99' SCT, |Sj < oo, u> ff eRj. Since functions 

in span(J 7 ) are not bounded, it is not possible to obtain a uniform regret bound for all functions in 
span(J 7 ): rather, the regret of an online learning algorithm A for span(J 7 ) is specified in terms of 
regret bounds for individual comparator functions / £ span(F'), viz. 

T T 

Rf(T) := 5>(-4(x t ))-5>(/(x t )). 

t=i t -1 


Loss function classes. The base loss function class we consider is £, the set of all linear functions 
t : R d —> R, with Lipschitz constant bounded by 1. A function class T that is online learnable with 
the loss function class C is called online linear learnable for short. The richer loss function class we 
consider is denoted by C and is a set of convex loss functions l : R d —> R satisfying some regularity 
conditions specified in terms of certain parameters described below. 

We define a few parameters of the class C. For any b > 0, let B d ( 6 ) = {y £ R d : ||y|| < b} be 
the ball of radius b. The class C is said to have Lipschitz constant L b on B d (5) if for all i £ C and all 
y £ B d (b) there is an efficiently computable subgradient Vf(y) with norm at most L b . Next, C is said 
to be ^-smooth on B d (b) if for all t £ C and all y, y' £ B d (b) we have 

W) < ^(y) + V<?(y) • (y' — y) + y-||y — y'|| 2 . 

Next, define the projection operator lib : R d —> B d (b) as Ilb(y) := argmin y / gB d( & ) ||y — y'||, and define 

— sun „ liMMy) 

e b — sup y 6 R d j ||n b (y)—y|| • 

3 Online Boosting Algorithms 

The setup is that we are given a D-bounded reference class of functions J- with an online linear 
learning algorithm A with regret bound R(-). For normalization, we also assume that the output of 
A at any time is bounded in norm by D , i.e. ||A(x t )|| < D for all t. We further assume that for every 
b > 0, we can compute^ a Lipschitz constant L b , a smoothness parameter and the parameter e b 
for the class C over B d ( 6 ). Furthermore, the online boosting algorithm may make up to N calls per 
iteration to any copies of A it maintains, for a given a budget parameter N. 

Given this setup, our main result is an online boosting algorithm, Algorithm [ 1 ] competing with 
span(J 7 ). The algorithm maintains N copies of A, denoted A\ for i = 1,2, ...,1V. Each copy 
corresponds to one stage in boosting. When it receives a new example x t , it passes it to each A 1 
and obtains their predictions A*(x t ), which it then combines into a prediction for y t using a linear 
combination. At the most basic level, this linear combination is simply the sum of all the predictions 
scaled by a step size parameter 77 . Two tweaks are made to this sum in step 8 to facilitate the analysis: 

1. While constructing the sum, the partial sum yj -1 is multiplied by a shrinkage factor (1 — (J l t rf). 
This shrinkage term is tuned using an online gradient descent algorithm in step 14. The goal of 
the tuning is to induce the partial sums yJ -1 to be aligned with a descent direction for the loss 
functions, as measured by the inner product V£ t (Yt^ 1 ) • y\~ • 

3 It suffices to compute upper bounds on these parameters. 
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Algorithm 1 Online Gradient Boosting for span(J r ) 

Require: Number of weak learners A, step size parameter 77 £ [- 7 , 1], 
l: Let B = mm{riND, inf{& > D : r/Pbb 2 > £bL>}}. 

2 : Maintain A copies of the algorithm A, denoted A 1 for i = 1,2,..., A. 

3: For each i, initialize <j\ = 0. 

4: for t = 1 to T do 
5: Receive example x t . 

6 : Define y° = 0 . 

7: for i = 1 to A do 

8: Define y\ = II B ((1 - tx^jyj -1 + 77 M*(x f )). 

9: end for 

10 : Predict y t = y^. 

11 : Obtain loss function t t and suffer loss £ t (y t ). 

12 : for 7 = 1 to A do 

13: Pass loss function i\{ y) = -^Vf t (yJ _1 ) • y to A 1 . 

14: Set a l t+1 = max{min{crj + a t W t ( y'f 1 ) ' y^ -1 ): !}, 0}, where a t = Lb \, vi - 

15: end for 

16: end for 


2. The partial sums y\ are made to lie in B d (R), for some parameter B, by using the projection 
operator 11^. This is done to ensure that the Lipschitz constant and smoothness of the loss 
function are suitably bounded. 

Once the boosting algorithm makes the prediction y t and obtains the loss function £ t . each A 1 is 
updated using a suitably scaled linear approximation to the loss function at the partial sum y\~ , i.e. 
the linear loss function 77 ^ V^ t (yJ _1 ) • y. This forces A 1 to produce predictions that are aligned with 
a descent direction for the loss function. 

We provide the analysis of the algorithm in Section 14.21 The analysis yields the following regret 
bound for the algorithm: 

Theorem 1. Let r) £ 1] be a given parameter. Let B = min^AD, inf {6 > D : rjf3i,b 2 > et,Z)}}. 

Algorithmic is an online learning algorithm for span(£F) and losses inC with the following regret bound 
for any f £ span(T): 

, \ N _ 

R'f(T) < ^ + ^BB 2 \\f\\ l T + L B \\f\\ 1 R{T) + 2L B B\\f\\ 1 Vr^ 

where A 0 := ELi^(°) ^^t(/( x t))- 

The regret bound in this theorem depends on several parameters such as B 1 /3 B and L B . In 
applications of the algorithm for 1 -dimensional regression with commonly used loss functions, however, 
these parameters are essentially modest constants; see Section Id. II for calculations of the parameters 
for various loss functions. Furthermore, if 77 is appropriately set (e.g. 77 = (logA)/A), then the 
average regret R'f{T)/T clearly converges to 0 as A 00 and T —> 00 . While the requirement that 
A —> 00 may raise concerns about computational efficiency, this is in fact analogous to the guarantee 
in the batch setting: the algorithms converge only when the number of boosting stages goes to infinity. 
Moreover, our lower bound (Theorem [3]) shows that this is indeed necessary. 

We also present a simpler boosting algorithm, Algorithm [2J that competes with CH(J r ). Algo¬ 
rithm [2] is similar to Algorithm [T) with some simplifications: the final prediction is simply a convex 
combination of the predictions of the base learners, with no projections or shrinkage necessary. While 
Algorithm [l] is more general, Algorithm [2] may still be useful in practice when a bound on the norm of 
the comparator function is known in advance, using the observations in Section [5.21 Furthermore, its 
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analysis is cleaner and easier to understand for readers who are familiar with the Frank-Wolfe method, 
and this serves as a foundation for the analysis of Algorithm [l] This algorithm has an optimal (up to 
constant factors) regret bound as given in the following theorem, proved in Section 14.11 The upper 
bound in this theorem is proved along the lines of the Frank-Wolfe [8J algorithm, and the lower bound 
using information-theoretic arguments. 

Theorem 2. Algorithm [H is an online learning algorithm for CH(J-) for losses in C with the regret 
bound 

R\T ) < T + L d R(T). 

Furthermore, the dependence of this regret bound on N is optimal up to constant factors. 

The dependence of the regret bound on R{T) is unimprovable without additional assumptions: 
otherwise, Algorithm [2] will be an online linear learning algorithm over J- with better than R(T) 
regret. 


Algorithm 2 Online Gradient Boosting for CllfJ 7 ) 

1: Maintain N copies of the algorithm A , denoted A 1 , A 2 ,..., A N , and let rji = for i = 
1,2,...,A. 

2: for t = 1 to T do 
3: Receive example x t . 

4: Define y° = 0 . 

5: for i = 1 to N do 

6: Define y\ = (1 - rj^y]- 1 + r] t A l (x t ). 

7: end for 

8: Predict y t = y™. 

9: Obtain loss function ( t and suffer loss £t(yt)- 

10: for i = 1 to N do 

11 : Pass loss function i\{ y) = -AjVf t (y) _1 ) • y to Mb 

12: end for 

13: end for 


Using a deterministic base online linear learning algorithm. If the base online linear learning 
algorithm A is deterministic, then our results can be improved, because our online boosting algorithms 
are also deterministic, and using a standard simple reduction, we can now allow C to be any set of 
convex functions (smooth or not) with a computable Lipschitz constant Lj, over the domain B d (b) for 
any b > 0. 

This reduction converts arbitrary convex loss functions into linear functions: viz. if y t is the 
output of the online boosting algorithm, then the loss function provided to the boosting algorithm 
as feedback is the linear function t' t {y) = Vf t (yt) • y. This reduction immediately implies that the 
base online linear learning algorithm A. when fed loss functions M— £' t , is already an online learning 
algorithm for CH(.F) with losses in C with the regret bound R'(T) < LdR(T). 

As for competing with span(J r ), since linear loss functions are 0-smooth, we obtain the following 
easy corollary of Theorem [lj 

Corollary 1. Let r] £ he a given parameter, and set B = r]ND. Algorithm \T\ is an online 

learning algorithm for spa^J 7 ) for losses in C with the following regret bound for any f £ span(lF): 

R' f {T) < (i -^) A 0 + L s ||/|| 1 i?(T) + 2L BJ B||/|| 1 v / T, 
where A 0 := -^(/(x t )). 
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3.1 The parameters for several basic loss functions 

In this section we consider the application of our results to 1-dimensional regression, where we assume, 
for normalization, that the true labels of the examples and the predictions of the functions in the class 
T are in [—1,1]. In this case || • || denotes the absolute value norm. Thus, in each round, the adversary 
chooses a labeled data point (x t , yf) £ X x [—1,1], and the loss for the prediction y t £ [—1,1] is given 
by £t{yt) = £(yt,Vt) where £(■, •) is a fixed loss function that is convex in the second argument. Note 
that D = 1 in this setting. We give examples of several such loss functions below, and compute the 
parameters Lb, fib and e b for every b > 0, as well as B from Theorem [H 

1. Linear loss: £(y*,y) = —y*y. We have Lb = 1, /3b = 0, e b = 1, and B = rjN. 

2. p-norrn loss, for somep > 2: £(y*, y) = \y*~y\ p ■ We have Lb = p(b+ l) p_1 , [3b = p(p—l)(6+l) p_2 , 
6b = max{p(l — &) p_1 ,0}, and B = 1. 

3. Modified least squares: £{y*,y) = | max{l — y*y, 0} 2 . We have L b = b + 1, ft b = 1, e b = 
max{l — b, 0}, and B = 1. 

4. Logistic loss: £{y*,y) = ln(l + exp {-y*y)). We have L b = /3 b = e b = , and 

B = minj^TV, ln(4/?y)}. 


4 Analysis 

In this section, we analyze Algorithms [T] and Algorithm [2] 


4.1 Competing with convex combinations of the base functions 

We give the analysis of Algorithm [2] before that of Algorithm [T| since it is easier to understand and 
provides the foundation for the analysis of Algorithm [T] 

Proof of Theorem [H First, note that for any i = 1, 2,..., N, since is a linear function, we have 


inf 

/eCHpq 


T 


Ew( x *)) 



Let / be any function in CH(J r ). The equality above and the fact that A 1 is an online learning 


algorithm for T with regret bound R(- 
imply that 


for the 1-Lipschitz linear loss functions £\ (y) = 


Ld 


T 

E 

t =i 


Tw.fy;- 1 ) -A‘M < 


x;Tv«y ; 

t=l U 


r 1 ) 


/(xt) + R(T). 


Vf t (y r 1 )^ 


( 2 ) 
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Now define, for i = 0,1,2,..., N, A * = J^t =i ^t(yj) - M/( x t))- We have 


A, = ^(y^ + ^M-yr 1 ))-*,(/(*)) 


t=l 

T 


< E^(yr') - m/(*o)+ ^v^fy*- 1 ) • (^(x t ) - yr 1 ) + ^ii^x*) - y ; 


J-l || 2 


(by /3D-smoothness of C) 


< 


E £ *( y * ^ - ^t(f(x t )) +r/iV£ t (y\ 1 )-(/(x t )-y* 1 ) + 2p 2 /3 D D 2 


r]iL D R(T) 


(by ([2]) and using the bound ||.A*(x t ) — y\ 1 || < 2D) 


< 


5>( yr 1 ) - ^(/( x ‘)) - viiUyi 1 ) - ^(/(x t ))) + 2 vlPnD 2 


+ 77 l L D R(T) 


(by convexity, £t{y\ 1 ) + V£(yJ 1 )-(/(x t )-y| x ) < ^t(/(x t ))) 
< (1 - ? 7 i)Aj_i + 2pf f) D D 2 T + r]iL D R(T). 


For i = 1, since rji = 1, the above bound implies that Ai < 2/3dD 2 T + LjjRfT). Starting from this 
base case, an easy induction on * > 1 proves that A,; < 8 fe D T + Lr>R(T). Applying this bound for 
i = N completes the proof. □ 

We now show that the dependence of the regret bound of Algorithm [2] on the parameter N is 
optimal up to constant factors. 

Theorem 3. Let N be any specified bound on the total number of calls in each iteration to all copies 
of the base online linear learning algorithm. Then there is a setting of 1-dimensional prediction with 
a 1-bounded comparator function class T, an online linear optimization algorithm A over T, and a 
class C of loss functions that is 1-smooth on M such that any online boosting algorithm for CRfiF) 
with losses in C respecting the bound N has regret at least 

Proof. Consider the following construction. At a high level, the setting is 1-dimensional regression 
with C corresponding to squared loss. The domain X = N and true labels of examples are in [0,1]. 

Define p 1 = ^ + e and P 2 = \ — e, where e = TcfTjv’ anc ^ ^ ^ >1 an< ^ ^ 2 ^ w0 distributions over 
{0,1}^ where each bit is Bernoulli random variable with parameter p\ and P 2 respectively, chosen 
independently of the other bits. Consider a sequence of examples (x t ,j/j) effx [0,1] generated as 
follows: x t = t , and the label yf is chosen from {pi,P 2 } uniformly at random in each round. 

Let for c = _ The function class J- consists of a large number, M = i N , of functions fi, 

i £ [M\. For each t and i, we set /*(x t ) = 1 w.p. yfi and 0 w.p. 1 — yfi independently of all other 
values of t and i. 

The base online linear learning algorithm A is simply Hedge over the M functions. In each round, 
the Hedge algorithm selects one of the M functions in T and uses that to predict the label, and for 
any sequence of T examples, with high probability, incurs regret R(T ) = 0(y/\og(M)T). 

We set C to be set of squared loss functions, i.e. functions of the form £{y) = ^(y — y*) 2 for 
y* £ [0,1]. Note that these loss functions are 1-smooth and D = 1. In round f, the loss function is 

Zt(y) = \{y - vt) 2 - 

Consider the function f = jj fi> which is in CH(J r ). Given any input sequence (x t , yf) for 

t = 1, 2, ..., T it is easy to calculate that E[i(/(x t ) — y$) 2 ] = Vt , and since the examples 

and predictions of functions on the examples are independent across iterations, a simple application 











of the multiplicative Chernoff bound implies that if T > 12 M, then with probability at least 0.9, we 

haveELi W( x t) ~yt ) 2 < w- 

Now suppose there is an online boosting algorithm making at most TV calls total to all copies of 
A in each iteration, that for any large enough T and for any sequence (x t ,j/j) for t = 1,2,... ,T, 
outputs predictions y t such that with high probability, say at least 0.9, we have Y^t =1 |(l/t ~ Vt) 2 — 
J2t=i ^(/(x t ) — j/j ) 2 + ^. Then by a union bound, with probability at least 0.8, we have J2t= i \(Vt ~ 
Vt) 2 — w + JT — HT- By Markov’s inequality and a union bound, with probability at least 0.7, for 
a uniform random time r £ [T], we have 




2 ’ 


(3) 


or in other words, y T is on the same side of \ as y*, and thus can be used to identify y*. In the rest 
of the proof, we will use this fact, along with fact the total variation distance between D\ and D 2 , 
denoted cLtv(Di, D 2 ), is small, to derive a contradiction. 

Define the random variable Y : {0,1} W —> R as follows. For any bit string s = (si, s 2 , ■ ■ •, Sn) £ 
{0,1}choose a random round r £ [T], and simulate the online boosting process until round r — 1 by 
sampling ’s and the outputs of /)(x t ) for all t < r — 1 and i £ [M\ from the appropriate distributions. 
In round r, let /p, fi 2 , ■ • ■, fi N be the functions that are obtained from the at most TV calls to copies of 
A (there could be repetitions). Assign /» (x r ) = Sj for j £ [TV] (being careful with repeated functions 
and repeating outputs appropriately), and run the booster with these outputs to obtain y T , and set 
Y(s) = y T . Let Pr[-] denotes probability of events in this process for generating Y(s) given s. 

Let Ei[X(s)] and E 2 [X(s)] denote expectation of a random variable X : {0,1}^ —> R when s 
is drawn from D\ and Di respectively, and let Eo [X ( I, s)] denote expectation of a random variable 
X : {1,2} x {0,1}^ —»• R when I is chosen from {1,2} uniformly at random and then s is sampled 
from Dj. The above analysis (inequality ([5])) implies that 


0.7 < E 0 [Pr[|y(s) — pi\ < e]] = ifi, [Pr[\Y(a) - Pl \ < e]] + ±E 2 [Pt[\Y (s) - p 2 \ < e]]. 
Now define a random variable X : {0,1} W -4 1 as X (s) = Pr[F(s) > 5 ]. Since 

Pr[y(s) > |] > Pr[|y(s) — pi| < e] and 1 — Pr[y(s) > |] > Pr[|y(s) — p 2 \ < e], 


we conclude, using the above bound, that Ei[X(s)] — E 2 [A'(s)] > 0.4. This is a contradiction, since 
because X(s) £ [0,1], we have 


Ei[X(s)]-Ea[A'(a)] < d TY {D 1 ,D 2 ) < = 0.4, 

where the bound on <1t\{Di,D 2 ) is standard, for e.g. see [15]. This gives us the desired contradiction. 

□ 


The above result can be easily extended to any given parameters /? and D so that the J- is ID- 
bounded and C is /3-smooth on R, giving a lower bound of f ^ D N T ) on the regret of an online boosting 
algorithm for CH(J r ) with losses in C : we simply scale all function and label values by D , and consider 
the loss functions £{y,y*) = ^{y — J/*) 2 - If there were an online boosting algorithm for CH(J r ) with 

these loss functions with regret o( /3D N T ), then by scaling down the predictions by P, we obtain an 
online boosting algorithm for exactly the setting in the proof of Theorem [3] with a regret bound of 
o(j^), which is a contradiction. 


4.2 Competing with the span of the base functions 

In this section we show that Algorithm |T] satisfies the regret bound claimed in Theorem [L] 
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Proof of Theorem [7J Let / = ]C s eS *s?i f° r some finite subset S of J 7 , where w g £ R. Since _F is 
symmetric, we may assume that all w g > 0, and let W := Y^ g w g- Furthermore, we may assume that 
0 £ S with weight w 0 = max{l — J2 g ^s g /o w s;0}, so that W > 1. Note that ||/||i is exactly the 
infimum of W over all such ways of expressing / as a finite weighted sum of functions in J-. We now 
prove that bound stated in the theorem holds with ||/||i replaced by W; the theorem then follows 
simply by taking the infimum of the bound over all such ways of expressing /. 

Now, for each i £ [TV], the update in line 14 of Algorithm |T| is exactly online gradient descent 251 
on the domain [0,1] with linear loss functions a K > — VMy* ) • yj -cr. Note that the derivative of 
this loss function is bounded as follows: | — Vf? t (yJ -1 ) • yj _1 | < LbB. Since 4? £ [0,1], the standard 
analysis of online gradient descent then implies that the sequence a\ for t = 1,2,..., T satisfies 

T T 


E-^yr^yrM < E- v ^(yr 1 )-yr 1 4?+ 2L ^^. 


t=i 


t=i 


VF 


(4) 


Next, since / = w q 9 w g > 0, we have 
t T 




i 


w 


E 


ges 


w. 


■ E E ■ 5( x «) > ™E w *(y i)-g( x t)- ( 5 ) 


t= i ges 


Let g* £ argmin ge 5 J2t=i V^( yl) ' d( x t)- Since A’ is an online learning algorithm for T with regret 
bound R(-) for the 1-Lipschitz linear loss functions (\{y) = yl~) ‘ Yj and 9* £ ?■> multiplying 

the regret bound © by Lb we have 

t t -\ T 

E v^yr 1 ) • ^(x t ) < E v MyE) • 5*( x t) + L B R(T) < — E v^yE) • /(x t ) + l b r [ t ) (6) 


£=1 


£—1 


£=1 


by ©. Now, we analyze how much excess loss is potentially introduced due to the projection in line 8. 
First, note that if B = gND, then the projection has no effect since (1 — & l t rf) yj —1 + ?7-4*( x t) £ B d (B), 
and in this case £t(yl) = M( 1 — + gA l (x t )). If B < gND, then by the definition of B , 


g/3 B B 2 > e B D, and since (1 — (j\rf)y\ 


\B) and ||r/yl l ( x t))|| < ??-D> and we have 


My l) = a l t g)y l t + r]A l {x t ))) < £ t ((l - <j\g)y l t +gA l (x t )) + ge B D. 


In either case, we have 


M yl) < M 4 - °lv)yl +vA l (* t )) + g p B B-. 


(7) 


We now move to the main part of the analysis. Define for i = 0,1, 2,..., N, A, := Y^t=i M yl) ~ 


£ t (/(x t )). We have 


A,, < 


E M(! - °iv)yl 1 + v-A l (x t )) - M/( x 4 ) 


+ g 2 p B B 2 T 


< A,_! 


E^w^yr 1 ) • m ?; ( x *) - m i ( x *) - ^Ei 2 

.*=i 


+ g 2 p B B 2 T 


(by /3s-smoothness) 


< A,_! 


Ejf^Myt 1 )-(/( x t)-yJ *) 


+ 3r) 2 f3 B B 2 T + gL B R(T) + 2g L b bVT 


(by ©, © and the fact that ||yF(x t ) — y\ || < D + B < 2 B) 


< 


(l - A,_! + 3rj 2 0 b B 2 T + r]L B R(T) + 2 t)L b bVt, 
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since, by convexity of £ t we have £ t (y\ x ) + V£{y\ 1 ) ■ (/(x t ) - y\ x ) < £ t (/(x t )). Applying the above 
bound iteratively, we get 

JV N i-1 

A N < (l- A 0 + ^(l--^) J • (3r, 2 p B B 2 T + V L B R(T) + 2 V L b BVt) 

i =1 

< (l- ^)\ ] + 3 t7,0 Bj B 2 WT + L b W.R(T) + 2Lb5IFv / T. 

This completes the proof. □ 

5 Variants of the boosting algorithms 

Our boosting algorithms and the analysis are considerably flexible: it is easy to modify the algorithms 
to work with a different (and perhaps more natural) kind of base learner which does greedy fitting, 
or incorporate a scaling of the base functions which improves performance. Also, when specialized to 
the batch setting, our algorithms provide better convergence rates than previous work. 


5.1 Fitting to actual loss functions 


The choice of an online linear learning algorithm over the base function class in our algorithms was 
made to ease the analysis. In practice, it is more common to have an online algorithm which produce 
predictions with comparable accuracy to the best function in hindsight for the actual sequence of 
loss functions. In particular, a common heuristic in boosting algorithms such as the original gradient 
boosting algorithm by Friedman 0 or the matching pursuit algorithm of Mallat and Zhang [l8j is to 
build a linear combination of base functions by iteratively augmenting the current linear combination 
via greedily choosing a base function and a step size for it that minimizes the loss with respect to 
the residual label. Indeed, the boosting algorithm of Zhang and Yu |2J] also uses this kind of greedy 
fitting algorithm as the base learner. 

In the online setting, we can model greedy fitting as follows. We first fix a step size a > 0 in 
advance. Then, in each round t, the base learner A receives not only the example x t , but also an 
offset y' t € for the prediction, and produces a prediction A(x t ) £ R d , after which it receives the 
loss function £ t and suffers loss £t(y[ + cn4(x t )). The predictions of A satisfy 

T T 

+ <x4(x t )) < inf r ^]^(y , t + a/(x t )) + i?(T), 

t= 1 J t= 1 

where R is the regret. Our algorithms can be made to work with this kind of base learner as well. 
The details can be found in Section [A.II of the supplementary material. 


5.2 Improving the regret bound via scaling 

Given an online linear learning algorithm A over the function class T with regret R , then for any 
scaling parameter A > 0, we trivially obtain an online linear learning algorithm, denoted AA, over 
a A-scaling of T, viz. XT := {A/ | / £ T}, simply by multiplying the predictions of A by A. The 
corresponding regret scales by A as well, i.e. it becomes A R. 

The performance of Algorithm [I] can be improved by using such an online linear learning algorithm 
over XT for a suitably chosen scaling A > 1 of the function class T . The regret bound from Theorem [T] 
improves because the 1-norm of / measured with respect to XT, i.e. ||/||^ = max{l, MLi}, j s smaller 
than ||/||i, but degrades because the parameter B' = mm{r]NXD, inf{& > A D : r]/3bb 2 > tbXD}} is 
larger than B. But, as detailed in Section lA. 2 1 of the supplementary material, in many situations the 
improvement due to the former compensates for the degradation due to the latter, and overall we can 
get improved regret bounds using a suitable value of A. 
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5.3 Improvements for batch boosting 


Our algorithmic technique can be easily specialized and modified to the standard batch setting with 
a fixed batch of training examples and a base learning algorithm operating over the batch, exactly 
The main difference compared to the algorithm of jU is the use of the a variables to 


as m 


241. 


scale the coefficients of the weak hypotheses appropriately. While a seemingly innocuous tweak, this 
allows us to derive analogous bounds to those of Zhang and Yu 24] on the optimization error that 
show that our boosting algorithm converges exponential faster. A detailed comparison can be found 
in Section fA.31 of the supplementary material. 


6 Experimental Results 

Is it possible to boost in an online fashion in practice with real base learners? To study this question, 
we implemented and evaluated Algorithms |T] and [2] within the Vowpal Wabbit (VW) open source 
machine learning system [ 23 }. The three online base learners used were VW’s default linear learner 
(a variant of stochastic gradient descent), two-layer sigmoidal neural networks with 10 hidden units, 
and regression stumps. 

Regression stumps were implemented by doing stochastic gradient descent on each individual 
feature, and predicting with the best-performing non-zero valued feature in the current example. 

All experiments were done on a collection of 14 publically available regression and classification 
datasets (described in Section [B] iia the supplementary material) using squared loss. The only param¬ 
eters tuned were the learning rate and the number of weak learners, as well as the step size parameter 
for Algorithm [1] Parameters were tuned based on progressive validation loss on half of the dataset; 
reported is propressive validation loss on the remaining half. Progressive validation is a standard on¬ 
line validation technique, where each training example is used for testing before it is used for updating 
the model 

The following table reports the average and the median, over the datasets, relative improvement 
in squared loss over the respective base learner. Detailed results can be found in Section m in the 
supplementary material. 


Base learner 

Average relative improvement 
Algorithm [T] Algorithm [2] 

Median relative improvement 
Algorithm [1] Algorithm [2] 

SGD 

1.65% 

1.33% 

0.03% 

0.29% 

Regression stumps 

20.22% 

15.9% 

10.45% 

13.69% 

Neural networks 

7.88% 

0.72% 

0.72% 

0.33% 


Note that both SGD (stochastic gradient descent) and neural networks are already very strong 
learners. Naturally, boosting is much more effective for regression stumps, which is a weak base 
learner. 


7 Conclusions and Future Work 

In this paper we generalized the theory of boosting for regression problems to the online setting 
and provided online boosting algorithms with theoretical convergence guarantees. Our algorithmic 
technique also improves convergence guarantees for batch boosting algorithms. We also provide ex¬ 
perimental evidence that our boosting algorithms do improve prediction accuracy over commonly used 
base learners in practice, with greater improvements for weaker base learners. The main remaining 
open question is whether the boosting algorithm for competing with the span of the base functions is 
optimal in any sense, similar to our proof of optimality for the the boosting algorithm for competing 
with the convex hull of the base functions. 
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Supplementary material for “Online Gradient Boosting” 


A Variants of the boosting algorithms 

In this section we provide the omitted details of two variants of our boosting algorithms: (a) a 
variant that work with a different kind of base learner which does greedy fitting, and (b) a variant 
that incorporates a scaling of the base functions to improves performance. We also show how our 
algorithmic technique can be used to improve the convergence speed for batch boosting. 


A.l Fitting to actual loss functions 


The choice of an online linear learning algorithm over the base function class in our algorithms was 
made to ease the analysis. In practice, it is more common to have an online algorithm which produce 
predictions with comparable accuracy to the best function in hindsight for the actual sequence of 
loss functions. In particular, a common heuristic in boosting algorithms such as the original gradient 
boosting algorithm by Friedman 0 or the matching pursuit algorithm of Mallat and Zhang 01 is to 
build a linear combination of base functions by iteratively augmenting the current linear combination 
by greedily choosing a base function and a step size for it that minimizes the loss with respect to 
the residual label. Indeed, the boosting algorithm of Zhang and Yu 24j also uses this kind of greedy 
fitting algorithm as the base learner. 

In the online setting, we can model greedy fitting as follows. We first fix a step size a > 0 in 
advance. Then, in each round t, the base learner A receives not only the example x t , but also an 
offset y' t £ R d for the prediction, and produces a prediction A(x t ) £ R d , after which it receives the 
loss function £ t and suffers loss £ t {y' t + aA(x t )). The predictions of A satisfy 


J2 e t{y't+aA{x t )) 


t=l 


■af(x t ))+R(T), 


where R is the regret. We now describe how our algorithms can be made to work with this kind of 
base learner as well. 

Assume that for some known parameter B > 0, we have ||y£|| < B 1 for all t. Let B' = B + aD , 
and assume that the loss functions £t are Lb’ Lipschitz and /3 b’ smooth on B d (B'). Then using the 
convexity and smoothness of the loss functions, we have £ t (y' t + ad(x ( )) > £ t {y' t ) + aV£ t (y' t ) ■ -A(x f ) 
and £ t (y' f +a/(x t )) < £ t (y[)+aV£ t (y' t )-f{x t ) + ^^-\\f{K t )\\ 2 . Plugging these bounds into the above 
regret bound we get, for any / £ J 7 , 


E ■ ^( x *) ^ E ( v Wt) ■ /(xt) + H^a||/(x t )|| 2 


-R(T). 

a 


Since ||/(x i )|| < D, setting a = y 7 we conc l u de theit 

T T 

E V MyO-^(xt) < E V*t(yt) ■ /(x t ) + y/2 Pb’D*TR{T). (8) 

t=1 t=l 

This regret bound is sublinear in T if R(T) is sublinear. We can obtain a better regret bound if we 
assume that R(T) scales linearly with a: this is a natural assumption since the functions £t(y't + c^y) 
are aLs’ Lipschitz in the prediction y. In this case, the regret bound R(T ) = aR'(T) for some fixed 

2_R / (T) 

R‘ : N —> R + indepedent of a , and we can choose a = so that 

T T 

E W ‘(y0-A(x t ) < E V ^( y ‘) ■ /(x t ) + 2i? , (T). (9) 

t= 1 t =1 
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Either the bound © or © suffices for the analysis of our boosting algorithms to go through: 
to use this kind of base learner A, we again keep N copies A 1 , A 2 ,..., A N with a suitably chosen 
step size a , and simply change line 11 of Algorithm [2] and line 13 of Algorithm Q] to pass the offset 

y[ = yr 1 to A\ 

A.2 Improving the regret bound via scaling 

Given an online linear learning algorithm A over the function class T with regret R , then for any 
scaling parameter A > 0, we trivially obtain an online linear learning algorithm, denoted A A, over 
a A-scaling of T, viz. XT := {A/ | f £ T}, simply by multiplying the predictions of A by A. The 
corresponding regret scales by A as well, i.e. it becomes A R. 

The performance of Algorithm [I] can be improved by using such an online linear learning algorithm 
over XT for a suitably chosen scaling A > 1 of the function class T. Let 11,/' 11' = max{l, be the 

1-norm of / measured with respect to XT, and B' = mm{r]NXD, inf{6 > A D : rjBbb 2 > ebXD}}. 
Then we immediately get the following corollary of Theorem [T| 

Corollary 2. For any f £ span(T), let A 0 = ]>Ztlib(0) — £ t (/(x t )). Algorithm^ using XA as the 
online linear algorithm over XT, is an online learning algorithm for span(T) for losses in C with the 
following regret bound for any f £ span(T): 

R'f(T) < \1 - Ao + 3r,p BI B' 2 \\f\\[T + L B/ \\f\\\XR(T) + 2L B ,B'\\f\\' 1 VT. 

Choosing large values of A implies that ||/||j_ can be significantly smaller than ||/||i. But B' 
becomes bigger than B, and correspondingly, the parameters /3 s’ and Lb' become bigger than Bn 
and Lb respectively. Also, the (lower order) dependence on the regret term R(T) increases by a factor 
of A. 

However, it turns out (see Section m that in several common applications of the algorithm, B' 
can be set to be equal to B or the increase from B is a very slow growing function of A, such as log(A). 
In such situations choosing larger values of A leads to improvement in the higher order terms of the 
regret bound, while making the lower order term (i.e. Lb' ||/|| , 1 AI?(T)) worse; overall the regret bound 
can be improved by choosing a suitably large scaling factor A to balance between the two. 


A.3 Improvements for batch boosting 


Our algorithmic technique can be used to improve convergence speed for batch boosting as well, 
the setup considered by Zhang and Yu 


24]. Since the focus of this paper is on online boosting, we 


give a high level comparison of the bounds here, making some simplifying assumptions to ease the 
technical details, using our notation as much as possible. 

In the setup of Zhang and Yu 0 , we have a base set of real valued functions T , which we assume 
is symmetric and contains the zero function, 0. Then span(J r ) is a linear function space, and let || • || 
be some norm defined on span(J r ). For clarity of presentation, we assume that for any f £ T, we 
have ||/|| < 1. This implies that for any / € span(J r ), we have ||/|| < ||/||i. 

The goal is to minimize a given convex functional l : span(J r ) —> R over its domain, span(J r ). We 
assume, for simplicity, that t is /3-smooth over span(J r ) under the norm ||-||, i.e. for any /, f £ span(J r ), 
we have 

t(f') < ^(/) + vf(/)-(/ , -/) + &l/-/ , ll 2 - 


Zhang and Yu [24J assume] that we have access to a base learning algorithm A that, given any 


4 This is a slight simplification of the base learning algorithm considered in [23 , which also performs a search over 
the step size rj. Also, the analysis in [23 allows some optimization error for the base learning algorithm; to simplify the 
comparison we assume this error is 0. 
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/ £ span(_F) and a step size 77 > 0 can find a function g £ T that minimizes £{f + 77 g). We denote 
the output of A by A(f,rj). 

Given such a base learning algorithm, and a sequence of step sizes rft, r) 2 , ■ .the boosting algorithm 
of Zhang and Yu [24j computes a sequence of functions /o, / 1 , / 2 , ■ ■ ■ £ span(J r ) via greedy fitting as 
follows: fo is set to 0 , and for any i > 1, 

fi ■= fi-i +ViA(fi-i,rji). 


Define So = 1 and Sj = Si-± + rji for any i > 1. 

For any / £ span(/), for * = 1,2,..., let A, ; = £(fi) — £(/) denote the optimization errors of the 
function /j. Zhang and Yu 24] prove that for any TV £ N, we have 


A / *0 + ||/||l A 

Aat < -———A 0 

Sn + H/lli 


N 

E 

i=l 


Si 


sn 


11. v 

111 2 


( 10 ) 


Using the techniques in this paper, we can define a different boosting algorithm which works as 
follows. Given the same sequence of step sizes 771 , 772 , ■ ■ • as above, we set fo = 0, and for any i > 1, 


fi ■= (1 - cnrh)fi-i +ViA(fi-i,rji), 


where 


Gi := 


1 if Vd(fi-i) ■ fi-i > 0 
0 otherwise. 


We can analyze this algorithm along the lines of the proof of Theorem [1] First, let g^ = A{fi-\,r]i). 
Then for g £ T, we have £(fi~i + rngi) < £(fi~ 1 + Vi.g), and by the convexity and /3-snroothness of £, 
we conclude that 

V*(/<-i)-Si < V£(/j_i) • g + ^rji. 

Using this fact and following the proof of Theorem [I] we get the following bound on the optimization 
error \ = £(/,) — £(f) of the function f^. 


A n < exp ( — 


sn — so 


N 


E cx p - 


i= 1 


sn — Si 


p 


+ 1). 


(ii) 


We can compare our bound m to the bound m of Zhang and Yu [24[, by comparing cor¬ 
responding terms in the bound. For each term, we can calculate how large sjv needs to be for 
the term to be reduced to less than some given bound e. To reduce the first term to less than e 
our algorithm needs sjv > ||/||i log(^ a ) + So, whereas the algorithm of Zhang and Yu [3] needs 
sn > (^)(so + ||/||i) — ||y|| 1 - As for the second term, to reduce the *-th term in the sum to less 
than e, our algorithm needs sn > ||/||i log( pf; ‘ y —1) + Sj, whereas the algorithm of Zhang and Yu 


n 2i 

24| needs sn > (^-)(s, + ||/||i) — ||/||i- Since in either case, the dependence on e is log(l) for our 
algorithm, whereas it is 1 for the algorithm of Zhang and Yu 24], we conclude that our algorithm 
converges exponentially faster. 


B Description of Data Sets and Detailed Experimental Re¬ 
sults 

The datasets come from the UCI repository and various KDD Cup challenges. Below, d is the number 
of unique features in the dataset, and s is the average number of features per example. 
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Dataset 

Number of Total number of Average number of 
instances features features per example 

Task 

Label 

range 


a9a 

48,841 

123 


14 

classification 



abalone 

4,177 

10 


9 


regression 

[1,29] 


activity 

165,632 

20 


18.5 

classification 

[-1,1] 


adult 

48,842 

105 


12 

classification 

[0,1] 


bank 

45,211 

45 


15 

classification 

[-1,1] 


caljiousing 

20,640 

9 


9 


regression 

[0,1] 


casp 

45,730 

10 


10 


regression 

[0,1] 


census 

299,284 

401 


32 

classification 

[-1,1] 


covtype 

581,011 

54 


12 

classification 

[-1,1] 


kddcup04 (phy) 

50,000 

74 


32 

classification 

[0,1] 


letter 

20,000 

16 


15.6 

classification 

[-1,1] 


shuttle 

43,500 

9 


8 

classification 

[-1,1] 


slice 

53,500 

385 


135 


regression 

[0,1] 


year 

463,715 

90 


90 


regression 

[0,1] 


The following table provides the online squared losses summarized in Section [G] 


SGD 

Regression stumps 

Neural Networks 

Dataset 

Baseline 

AlgEI 

AlgE 

Baseline 

Algffl 

AlgH 

Baseline 

Aigtn 

AlgH 

kddcup04/phy 

0.7475 

0.7466 

0.7470 

0.9201 

0.7733 

0.7924 

0.7441 

0.7480 

0.7446 

caLhousing 

0.0094 

0.0094 

0.0104 

0.0151 

0.0138 

0.0124 

0.0096 

0.0096 

0.0107 

casp 

0.0632 

0.0631 

0.0630 

0.0741 

0.0741 

0.0742 

0.0639 

0.0632 

0.0631 

a9a 

0.4261 

0.4283 

0.4249 

0.5749 

0.5074 

0.5758 

0.4256 

0.4266 

0.4246 

abalone 

3.7263 

3.7482 

3.7154 

6.7791 

3.8273 

4.2270 

3.7380 

3.7255 

3.7212 

activity 

0.0334 

0.0337 

0.0316 

0.4492 

0.1454 

0.3141 

0.0192 

0.0143 

0.0186 

adult 

0.1055 

0.1057 

0.1056 

0.1388 

0.1261 

0.1250 

0.1081 

0.1062 

0.1081 

bank 

0.2971 

0.2968 

0.2973 

0.3774 

0.3240 

0.3257 

0.2962 

0.2969 

0.2969 

census 

0.1544 

0.1545 

0.1553 

0.2073 

0.1884 

0.1789 

0.1531 

0.1531 

0.1523 

covtype 

0.7256 

0.7270 

0.7286 

0.7910 

0.7986 

0.7911 

0.6807 

0.6465 

0.6757 

letter 

0.6441 

0.5698 

0.6108 

0.7420 

0.7087 

0.7168 

0.6542 

0.5729 

0.6108 

shuttle 

0.1616 

0.1547 

0.1577 

0.8551 

0.3678 

0.4354 

0.0760 

0.0694 

0.0802 

slice 

0.0076 

0.0067 

0.0065 

0.0559 

0.0362 

0.0410 

0.0054 

0.0022 

0.0044 

year 

0.0116 

0.0119 

0.0115 

0.0152 

0.0140 

0.0141 

0.0116 

0.0119 

0.0122 
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